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Abstract 

The increasing availability of corpora anno- 
tated for linguistic structure prompts the 
question: if we have the same texts, anno- 
tated for phrase structure under two dif- 
ferent schemes, to what extent do the an- 
notations agree on structuring within the 
text? We suggest the term tree alignment 
to indicate the situation where two markup 
schemes choose to bracket off the same text 
elements. We propose a general method for 
determining agreement between two anal- 
yses. We then describe an efficient im- 
plementation, which is also modular in 
that the core of the implementation can be 
reused regardless of the format of markup 
used in the corpora. The output of the 
implementation on the Susanne and Pcnn 
treebank corpora is discussed. 

1 Introduction 

We present here a general design for, and mod- 
ular implementation of, an algorithm for comput- 
ing areas of agreement between structurally anno- 
tated corpora. Roughly speaking, if two corpora 
bracket off the same stretches of words in their struc- 
tural analysis of a text, the corpora agree that that 
stretch of text should be considered a single unit at 
some l evel of structure. We w ill (borrowing a usage 
from ( [Church and Gale, 1993| ) term this agreement 
(sub)tree alignment. 

We make the following assumptions, which appear 
reasonable for markup schemes with which we are 
familiar: 

• the "content" of each text consists of a sequence 
of "terminal" elements. That is, the content is 
a collection of elements generally correspond- 
ing to words and punctuation and this will be 



roughly constant across the two corpora. It may 
also contain additional elements to represent, 
for example, the positing of orthographically 
null categories. 

• the two corpora whose trees are to be aligned 
contain identifiable structural markup. That is, 
structural "delimiters" are distinct from other 
forms of markup and content. 

• two corpora agree on an analysis when they 
bracket off the same content. 

• The corpora may contain additional markup 
provided this is distinct from content and struc- 
tural markup. 

Our goal, then, is to determine those stretches of 
a text's content which two corpora agree on. Why 
might we want to do this? There are several reasons: 

• increase confidence in markup and determine 
areas of disagreement 

If two or more corpora agree on parts of an anal- 
ysis, one may "trust" that choice of grouping 
more than those groupings on which the cor- 
pora differ. Alignment can be used to detect 
disagreements between manual annotators. 

• verify preservation of analyses across multiple 
versions of a corpus 

If all the subtrees of a corpus are aligned with 
those of another, then the second is consistent 
with the first, and represents analyses at least 
as detailed as those in the first. Such automatic 
checking will be useful both in the case of man- 
ual edits to a corpus, and also in the case where 
automatic analysis is performed. 

• import markup from one corpus to another 

If one corpus contains "richer" information than 
another, for example in terms of annotation 



of syntactic function or of lexical category, the 
markup from the first may be interpreted with 
respect to analyses in the second. 

• determine constant markup transformations 

Having identified aligned subtrees, the labels of 
a pair of trees may be recorded, and the results 
for the pair of corpora analysed to determine 
consistent differences in markup. 

• determine constant tree transformations 

A set of pairings between aligned subtrees can 
be used as a bootstrap for semi-automatic 
markup of corpora. 

We can also identify some specific motivations 
and applications. First, in the automatic determina- 
tion of subcategorization information, confidence in 
the choice of subcategorization may be improved by 
analyses which confirm that subcategorization from 
other corpora. Second, the algorithm we have devel- 
oped is robust in the face of minor editorial differ- 
ences, choice of markup for punctuation, and overall 
presentation of t he corpora. We have processed the 
Susanne corpus (Sampson, 1995| ) and Penn treebank 
( Marcus et al, 1993 ) to provide tables of word and 



subtree alignments. Third, on the basis of the com- 
puted alignments between the two corpora, and the 
tree transformations they imply, the possibility is 
now open to produce, semi-automatically, versions 
of those parts of the Brown corpus covered by the 
Penn treebank but not by Susanne, in a Susanne- 
like format. Finally, in the development of phrasal 
parsers, our results can be used to obtain a measure 
of how contentious the analysis of different phrase 
types is. 

Obviously, the utility of algorithms such as the 
one we present here is dependent on the quality and 
reliability of markup in the corpora we process. 

2 The Task 

In this section, we provide a general characterization 
of agreement in analysis between two corpora. 

We assume the existence of two corpora, C' and 
C""Q The contents of each corpus is a sequence of 
elements drawn from a collection of terminal ele- 
ments, markers for the left and right structural de- 
limiters (LSD and RSD, respectively) and possibly 
other markup irrelevant to the content of the text or 
its structural analysis. Occurrences of structural de- 
limiters arc taken to be properly nested. We assume 
only that the terminal elements of some corpus can 



^for left and right. 



be determined, and not that the definition of termi- 
nal element correspond to some notion of, say, word. 
A consequence of this is that markers in a corpus for 
empty elements may be retained, and operated on, 
even if such markers are additional to the original 
text, and represent part of a hypothesis as to the 
text's linguistic organization. 

The following sequences can then be computed 
from each corpus: 

\Y{iyr} ^YiQ terminal elements 
gii^r} ^]-^g terminal elements 

and structural delimiters 

So S is the corpus retaining structural annotation, 
and W is a "text only" version of the corpus. As each 
of these is a sequence, we can pick out elements of 
each by an index, that is will pick out the nth 
terminal element of the left corpus. 

The following definitions allow us to refer to struc- 
tural units (subtrees) within the two corpora. (We 
omit the superscript indicating which corpus we are 
dealing with.) 

Numbering subtrees We number the subtrees in 
each corpus as follows. If Si is the ith occurrence of 
LSD in S and Sj is the matching RSD of Si , then the 
extent of subtree (z) of S" is the sequence Si . . . Sj. 
The terminal yield of a subtree is then its extent 
less any occurrences of LSD and RSD. This can be 
conveniently represented as the stretch of terminal 
elements included within a pair of structural delim- 
iters, i.e. 

yield(t) = (fc,Z) 

where Wk is the first element in the extent of t and 
Wi the last. We'll refer to a subtree's number as its 
index. Let Subtrees(C) be the set of yields in C. 

Two corollaries The following result will be use- 
ful later on: for two subtrees from a corpus, \i t < t' 
then either t' is a subtree of t or there is no domi- 
nance relation between t and t' . 

Likewise, we claim that, if a subtree is greater 
than unary branching, then it is uniquely identified 
by its yield. To see this, suppose that there are two 
distinct subtrees, t,t' such that yield(i) = yield(t') 
or = Then, no terminal element intervenes 

between Wi and t's LSD, or between Wj and t's 
RSD, and the same condition holds of t' . It must 
therefore follow that t is a subtree of t' or vice versa 
and that they are connected by a series of only unary 
branching trees. 

Alignment of terminal elements We want to 
compute the minimal set of differences between VF' 



and TV, i.e. a monotone, bijective partial function 
6 defined as follows :0 

Let 6 be the largest subset of i x j for 

< i < length(M^') and < j < length(W) such 

that 6 is monotone and bijective, and 

d{i) = j if either = WJ 

or I < i < length(T4^'), 

W'' 1 = W- 1 
" i~i J— 1' 

1 < j < length(T4^''), 
and W^!^, - W^^, 

In other words, 6 records exact matches between the 
left and right corpora, or mismatches involving only 
a single element, with exact matches to either side. 
This allows minor editorial differences and choice of 
markup for terminal elements to have no effect in 
overall alignment. 

Aligned subtrees We now offer the following def- 
inition. Two trees in C' and C" are aligned, if they 
share the same yield (under the image of S), i.e.: 

{W[,W^) e Subtrees(C"') and 
(W^iw'^io)) 6Subtrees(C0 

Two subtrees are strictly aligned if the above con- 
ditions hold and neither tree is a unary branch. 
(This definition will be extended shortly.) We saw 
above that, if a tree is not unary branching then its 
yield is unique. 

Unary branching In the case of unary branch- 
ing, the inverse of yield will not be a function. In 
other words, two subtrees have the same yield. The 
situation is straightforward if both corpora share the 
same number of unary trees for some yield: we can 
pair off subtrees in increasing order of index. (Re- 
call that, under dominance, a higher subtree index 
indicates domination by a lower index.) In this case 
we will say that the unary trees in question are also 
strictly aligned. 

If the two corpora differ on the number of unary 
branches relating two nodes, there is no principled 
way of pairing off nodes, without exploiting more 
detailed, and probably corpus- or markup-specific 
information about the contents of the corpora. 

Linking to original corpus For each of the cor- 
pora we assume we can define two functions, one 
terminal location will give the location in the orig- 
inal corpus of a terminal element (e.g. a function 

■^Of course, in the general case, such a function may 
not be unique. It seems a reasonable assumption in the 
case of substantial texts in a natural language that the 
function will be unique (although perhaps empty). 



from terminal indices to, say, byte offsets in a file), 
and the other tree location will give the location in 
the original corpus of a subtree (in terms, say, of 
byte offsets of the left and right delimiters). Tree 
locations will therefore include any additional infor- 
mation within the corpus stored between the left and 
right delimiters. 

Output of the procedure The following infor- 
mation may be output from this procedure in the 
form of tables 

• of subtree indices indicating strict alignment of 
two trees 

• a table of pairs of sequences of subtree indices 
indicating potential alignment 

• of pairs of terminal element indices, (i.e. the 
function 6) and 

• of single terminal element mismatches, for later 
processing to detect consistent differences in 
markup. 

• of the results of applying the functions terminal 
location and tree location to the relevant infor- 
mation above. 

This output can be thought of as a form of "stand 
off" annotation, from which other forms of informa- 
tion about the corpora can be derived. 

3 A portable implementation 

In this section wc describe the implementation of the 
above procedure which abstracts away from details 
of the markup used in any particular corpus. The 
overall shape of the implementation is shown in Fig- 
ure |. The program described here is implemented 
in Perl. 

Normalization We can abstract away from de- 
tails of the markup used in a particular corpus by 
providing the following externally defined functions. 

annotation removal and transformation 

As our procedure works only in terms of ter- 
minal elements and structural annotation, all 
other information may be removed from a cor- 
pus before processing. We also take this oppor- 
tunity to transform the LSD and RSD used in 
the corpus into tokens used by the core proces- 
sor (that is, { and } respectively). We may also 
choose at this point to normalize other aspects 
of markup known to consistently differ between 
the two corpora. 



^ Normalize to token stream 




Figure 1: Overall view of processing 



terminal and tree locations Similarly, separate 

programs may be invoked to provide tables of 
byte offsets of terminals and start- and end- 
points of trees. 

With these functions in place, we proceed to the 
description of the core algorithm. 

Computing minimal differences We use the 
program diff and interpret its output to compute 
the function S. Specifically we use the Free Software 
Foundations gdiff with the options — minimal, 
— ignore-case and — ignore-all-space, to guar- 
antee optimal matches of terminals, and allowing ed- 
itorial decisions that result in differences in capital- 
ization. 

Subtree indexing and alignment detection 

We use the following for representation of subtrees 
and the time-efficient detection of aligned trees. 
Trees in the right corpus (which we can think of 
as the target) are represented as elements in a hash 
table, whose key is computed from the terminal in- 
dices of the start and end of its yield. Each element 
in the hash table is a set of numbers, to allow for 
the hashing of multiple unary trees to the same cell 



in the table. 

In processing the subtrees for the left corpus, wo 
can simply check whether there is an element in the 
hash table for the terminal indices of the yield of 
the tree in the left corpus under the image of the 
function S. 

4 An example 

IN this section we give a brief example to illustrate 
the operations of the algorithm. The start of the 
Susanne corpus is shown in the table here: 

the [0[S[Nns:s. 

Fulton [Nns. 

county .Nns] 

grand 

jury .Nns:s] 
say [Vd.Vd] 
Friday [Nns:t.Nns:t] 
while the corresponding part of the treebank looks 
as follows. 

( (S 

(NP (DT The) (NMP Fulton) (NNP County) 

(MNP Grand) (NNP Jury) ) 

(VP (VBD said) 



(NP (NNP Friday) ) 

The process of numbering the terminal elements 
and computing the set of minimal differences will 
give rise to a normalized form of the two corpora 
something like the following, where the two leftmost 
columns come from Susanne, the others from Penn. 
(The numbers here have been altered slightly for the 
purposes of exposition.) 

Susanne word position Penn word position 



the 


2 


the 


1 


Fulton 


3 


Fulton 


2 


County 


4 


County 


3 


Grand 


5 


Grand 


4 


Jury 


6 


Jury 


5 



Note that the function 5 will in this case map 2 
to 1, 3 to 2 and so on. Note that the whole of this 
sequence of words is bracketed off in both corpora. 
Accordingly, we will record the existence of a tree 
spanning 1 to 5 in the treebank. The alignment of 
the corresponding tree from Susanne will be detected 
by the noting that (5(2) = 1 and (5(6) =5. 

5 Results of processing on two 
corpora 

We have processed the entire Susanne corpus and the 
corresponding parts of the Penn treebank, and pro- 
duced tables of alignments for each pair of marked- 
up texts. Inputs for this process were a Susanne 
file and the corresponding "combined" file from the 
treebank (i.e. including part-of-speech information). 
Recalling that the treebank marks up the relation- 
ship between pre-terminal and terminal as a unary 
tree (and that Susanne doesn't do this), the treebank 
regularly contains more trees than Susanne. 

First, a definition: a tree is maximal if it is not 
part of another tree within a corpus. We ignore max- 
imal trees of depth one in both corpora (as these cor- 
respond to indications of textual units rather than 
sentence-internal structural markup). Each maxi- 
mal tree containing a tree of greater than depth one 
in the treebank may also contain sentence punctua- 
tion which is treated within the structural markup. 
As such markup is typically treated as external to 
structural annotations within Susanne, trees con- 
taining a sentence and sentence punctuation cannot 
be a possible target for alignment across the two 
corpora. We can take the number of maximal trees 
of depth more than one within Susanne as an indi- 
cation of the number of trees within the treebank 
which are unalignable as a consequence of decisions 
about markup. This figure comes to 2431. 

With those considerations, we report the following 
findings: 



• There are 156584 terminal elements in Susanne 
and of those we find a total of 145583 (93%) for 
which a corresponding element is identified in 
the treebank. The corresponding figure for the 
treebank is 86% (of 169782 terminal elements 
in the treebank). 

• There are 110484 trees in Susanne (including 

1952 maximal trees of depth one) and so a to- 
tal of 108532 potentially aligned trees. Of these 
76011 (70%) are aligned with trees in the tree- 
bank. 

• There are 301086 trees in the treebank, of 
which we can eliminate 169782 as trees indi- 
cating preterminals (which includes 122174 con- 
taining just a textual delimiter), and an esti- 
mated further 2431 as representing trees includ- 
ing sentence punctuation. This gives a total of 
128873 (= 59%) of trees in the treebank pos- 
sibly aligned with those in Susanne are in fact 
aligned. 

The figures above bear out the impression that 
trees in the Penn treebank are more highly artic- 
ulated than those in Susanne, even leaving aside 
the additional structure induced by the treatment 
of punctuation and preterminals in the treebank. 

The entire process of computing the above out- 
put completes in approximately fifty minutes on an 
unloaded Sun SparcStation 20. 

6 Conclusions and Limitations 

We have seen above a formal characterization and 
implementation of an algorithm for determining the 
extent of agreement between two corpora. The core 
algorithm itself and output formats are completely 
independent of the markup used for the different cor- 
pora. The alignments computed for the Susanne cor- 
pus and corresponding portion of the Penn treebank 
have been presented and discussed. 

Having computed the aligmnent of trees across 
corpora, one option is to compute (either explic- 
itly or in some form of stand-off annotation) a cor- 
pus combining the information from both sources, 
thereby allowing the use of the distinctions made by 
each corpus at once. 

There are many future experiments of obvious in- 
terest, particularly those to do with examining po- 
tential factors in cases of agreement or disagreement: 

• analysis of consistency of annotation by markup 
label 

Certain phrase types may be more consistently 
annotated than others, so that we can be more 
confident in our analyses of such phrases. 



• analysis of consistency of annotation by depth 
in tree 

From the above discussion we can see that 
ahgnmcnt of maximal trees approximates 100%, 
while that for terminals approximates 90%. 
Therefore (and unsurprisingly) the bulk of dis- 
agreement lies somewhere in between. Is that 
disagreement evenly distributed or are there 
factors to do with the complexity of analysis 
at play? 

These proposals have to do essentially with formal 
aspects of markup. Other, perhaps more interesting 
questions, touch on the linguistic content of anal- 
yses, and whether for example particular linguistic 
phenomena are associated with divergence between 
the corpora. 

The assumption that trees within corpora are 
strictly nested represents an obvious limitation on 
the scope of the algorithm. In cases where markup 
is more complex, other strategies will have to be 
developed for detecting agreement between corpora. 
That said, the class of markup for which the algo- 
rithm presented here is applicable is very large, in- 
cluding per haps most impo rtantly normalized forms 
of SGML (iGoldfarb, 1990|), for example that pro- 
posed by ( Thompson and McKelvie, 1996 ). 
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